<!DOCTYPE html>
<html>
<head><meta name="generator" content="Hexo 3.9.0">
  <meta charset="utf-8">
  
  <title>【spark】pyspark之rdd | MaxMa</title>
  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
  
  <meta name="keywords" content="sparkRDDpyecharts正则表达式">
  
  
  
  
  <meta name="description" content="首先，先给出用Spark（RDD）进行数据分析的思维导图，如下图： 1 思维导图 2 正则表达式从原始文本数据中，按照一定的规则形式，提取出我们想要的数据字段，正则表达式很关键咯！！下面是正则表达式通配符表： 验证正则表达式写的对不对的小工具网页：https://regexr.com/     表达式 含义 表达式 含义     [abcd] 匹配候选集 (re)su(lt) 分组，group按照">
<meta name="keywords" content="spark,RDD,pyecharts,正则表达式">
<meta property="og:type" content="article">
<meta property="og:title" content="【Spark】PySpark之RDD">
<meta property="og:url" content="https://anxiang1836.github.io/2019/08/11/Spark_RDD/index.html">
<meta property="og:site_name" content="MaxMa">
<meta property="og:description" content="首先，先给出用Spark（RDD）进行数据分析的思维导图，如下图： 1 思维导图 2 正则表达式从原始文本数据中，按照一定的规则形式，提取出我们想要的数据字段，正则表达式很关键咯！！下面是正则表达式通配符表： 验证正则表达式写的对不对的小工具网页：https://regexr.com/     表达式 含义 表达式 含义     [abcd] 匹配候选集 (re)su(lt) 分组，group按照">
<meta property="og:locale" content="zh-CN">
<meta property="og:image" content="https://raw.githubusercontent.com/anxiang1836/FigureBed/master/img/20190821222536.png">
<meta property="og:updated_time" content="2019-11-05T17:32:02.099Z">
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="【Spark】PySpark之RDD">
<meta name="twitter:description" content="首先，先给出用Spark（RDD）进行数据分析的思维导图，如下图： 1 思维导图 2 正则表达式从原始文本数据中，按照一定的规则形式，提取出我们想要的数据字段，正则表达式很关键咯！！下面是正则表达式通配符表： 验证正则表达式写的对不对的小工具网页：https://regexr.com/     表达式 含义 表达式 含义     [abcd] 匹配候选集 (re)su(lt) 分组，group按照">
<meta name="twitter:image" content="https://raw.githubusercontent.com/anxiang1836/FigureBed/master/img/20190821222536.png">
  
    <link rel="alternate" href="/atom.xml" title="MaxMa" type="application/atom+xml">
  

  

  <link rel="icon" href="/css/images/mylogo.jpg">
  <link rel="apple-touch-icon" href="/css/images/mylogo.jpg">
  
    <link href="//fonts.googleapis.com/css?family=Source+Code+Pro" rel="stylesheet" type="text/css">
  
  <link href="https://fonts.googleapis.com/css?family=Open+Sans|Montserrat:700" rel="stylesheet" type="text/css">
  <link href="https://fonts.googleapis.com/css?family=Roboto:400,300,300italic,400italic" rel="stylesheet" type="text/css">
  <link href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" rel="stylesheet">
  <style type="text/css">
    @font-face{font-family:futura-pt; src:url("css/fonts/FuturaPTBold.otf") format("woff");font-weight:500;font-style:normal;}
    @font-face{font-family:futura-pt-light; src:url("css/fonts/FuturaPTBook.otf") format("woff");font-weight:lighter;font-style:normal;}
    @font-face{font-family:futura-pt-italic; src:url("css/fonts/FuturaPTBookOblique.otf") format("woff");font-weight:400;font-style:italic;}
}

  </style>
  <link rel="stylesheet" href="/css/style.css">

  <script src="/js/jquery-3.1.1.min.js"></script>
  <script src="/js/bootstrap.js"></script>

  <!-- Bootstrap core CSS -->
  <link rel="stylesheet" href="/css/bootstrap.css">

  
    <link rel="stylesheet" href="/css/dialog.css">
  

  

  
    <link rel="stylesheet" href="/css/header-post.css">
  

  
  
  

</head>
</html>


  <body data-spy="scroll" data-target="#toc" data-offset="50">


  
  <div id="container">
    <div id="wrap">
      
        <header>

    <div id="allheader" class="navbar navbar-default navbar-static-top" role="navigation">
        <div class="navbar-inner">
          
          <div class="container"> 
            <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
              <span class="sr-only">Toggle navigation</span>
              <span class="icon-bar"></span>
              <span class="icon-bar"></span>
              <span class="icon-bar"></span>
            </button>

            
              <a class="brand" style="
                 margin-top: 0px;"  
                href="#" data-toggle="modal" data-target="#myModal" >
                  <img width="124px" height="124px" alt="Hike News" src="/css/images/mylogo.jpg">
              </a>
            
            
            <div class="navbar-collapse collapse">
              <ul class="hnav navbar-nav">
                
                  <li> <a class="main-nav-link" href="/">首页</a> </li>
                
                  <li> <a class="main-nav-link" href="/archives">归档</a> </li>
                
                  <li> <a class="main-nav-link" href="/categories">分类</a> </li>
                
                  <li> <a class="main-nav-link" href="/tags">标签</a> </li>
                
                  <li> <a class="main-nav-link" href="/about">关于</a> </li>
                
                  <li><div id="search-form-wrap">

    <form class="search-form">
        <input type="text" class="ins-search-input search-form-input" placeholder="" />
        <button type="submit" class="search-form-submit"></button>
    </form>
    <div class="ins-search">
    <div class="ins-search-mask"></div>
    <div class="ins-search-container">
        <div class="ins-input-wrapper">
            <input type="text" class="ins-search-input" placeholder="请输入关键词..." />
            <span class="ins-close ins-selectable"><i class="fa fa-times-circle"></i></span>
        </div>
        <div class="ins-section-wrapper">
            <div class="ins-section-container"></div>
        </div>
    </div>
</div>
<script>
(function (window) {
    var INSIGHT_CONFIG = {
        TRANSLATION: {
            POSTS: '文章',
            PAGES: '页面',
            CATEGORIES: '分类',
            TAGS: '标签',
            UNTITLED: '(无标题)',
        },
        ROOT_URL: '/',
        CONTENT_URL: '/content.json',
    };
    window.INSIGHT_CONFIG = INSIGHT_CONFIG;
})(window);
</script>
<script src="/js/insight.js"></script>

</div></li>
            </div>
          </div>
                
      </div>
    </div>

</header>



      
            
      <div id="content" class="outer">
        
          <section id="main" style="float:none;"><article id="post-Spark_RDD" style="width: 75%; float:left;" class="article article-type-post" itemscope itemprop="blogPost" >
  <div id="articleInner" class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 class="thumb" class="article-title" itemprop="name">
      【Spark】PySpark之RDD
    </h1>
  

      </header>
    
    <div class="article-meta">
      
	<a href="/2019/08/11/Spark_RDD/" class="article-date">
	  <time datetime="2019-08-11T08:37:03.783Z" itemprop="datePublished">2019-08-11</time>
	</a>

      
    <a class="article-category-link" href="/categories/Spark/">Spark</a>

      
	<a class="article-views">
	<span id="busuanzi_container_page_pv">
		阅读量<span id="busuanzi_value_page_pv"></span>
	</span>
	</a>

      

    </div>
    <div class="article-entry" itemprop="articleBody">
      
        <p>首先，先给出用Spark（RDD）进行数据分析的思维导图，如下图：</p>
<h2 id="1-思维导图"><a href="#1-思维导图" class="headerlink" title="1 思维导图"></a>1 思维导图</h2><p><img src="https://raw.githubusercontent.com/anxiang1836/FigureBed/master/img/20190821222536.png" alt></p>
<h2 id="2-正则表达式"><a href="#2-正则表达式" class="headerlink" title="2 正则表达式"></a>2 正则表达式</h2><p>从原始文本数据中，按照一定的规则形式，提取出我们想要的数据字段，正则表达式很关键咯！！下面是正则表达式通配符表：</p>
<p>验证正则表达式写的对不对的小工具网页：<a href="https://regexr.com/" target="_blank" rel="noopener">https://regexr.com/</a></p>
<div class="table-container">
<table>
<thead>
<tr>
<th>表达式</th>
<th>含义</th>
<th>表达式</th>
<th>含义</th>
</tr>
</thead>
<tbody>
<tr>
<td>[abcd]</td>
<td>匹配候选集</td>
<td>(re)su(lt)</td>
<td>分组，group按照从1编序</td>
</tr>
<tr>
<td>\.</td>
<td>匹配”.”</td>
<td>.</td>
<td>除\n之外所有的字符</td>
</tr>
<tr>
<td>\d</td>
<td>所有数字</td>
<td>\D</td>
<td>所有非数字</td>
</tr>
<tr>
<td>\s</td>
<td>所有空白符</td>
<td>\S</td>
<td>所有非空白符</td>
</tr>
<tr>
<td>\w</td>
<td>A-Z a-z 0-9</td>
<td>\w</td>
<td>除A-Z a-z 0-9之外</td>
</tr>
<tr>
<td>m{3}</td>
<td>匹配3个m</td>
<td>m*</td>
<td>0个或多个</td>
</tr>
<tr>
<td>m+</td>
<td>1个或多个</td>
<td>m?</td>
<td>0个或1个</td>
</tr>
<tr>
<td>^the</td>
<td>匹配开头</td>
<td>$</td>
<td>匹配结尾</td>
</tr>
</tbody>
</table>
</div>
<h2 id="3-读入数据"><a href="#3-读入数据" class="headerlink" title="3 读入数据"></a>3 读入数据</h2><p>调用textFile的API，这里需要划重点的是，需要多次反复用的数据，要记得.cache()进行缓存，从而减少不必要的I/O时间。</p>
<h2 id="4-RDD操作"><a href="#4-RDD操作" class="headerlink" title="4 RDD操作"></a>4 RDD操作</h2><p>这里面需要注意的有一点：那就是Transformation和action的区别：</p>
<ul>
<li>Transformation：这个是惰性计算，当在程序中写完的时候，并不执行；</li>
<li>Action：会激活上面没有完成的计算的Transformation。</li>
</ul>
<p>在思维导图中，列举的大部分常见的操作，如果还有需求的话，那就查API的速查表好了！！</p>
<h2 id="5-数据统计可视化"><a href="#5-数据统计可视化" class="headerlink" title="5 数据统计可视化"></a>5 数据统计可视化</h2><p>这里我补充了一下pyecharts的工具库，可以支持在notebook中，json立可交互图形可视化。</p>
<p>官方GitHub为：<a href="https://github.com/pyecharts/pyecharts" target="_blank" rel="noopener">https://github.com/pyecharts/pyecharts</a></p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install pyecharts -U</span><br></pre></td></tr></table></figure>
<p>如果不显示图的话，那么需要安装html5lib。</p>
<p>pyecharts 的地图库</p>
<p>【Anaconda Prompt命令行】pip install echarts-countries-pypkg</p>
<p>【Anaconda Prompt命令行】pip install echarts-china-provinces-pypkg</p>
<p>【Anaconda Prompt命令行】pip install echarts-china-cities-pypkg</p>
<p>【Anaconda Prompt命令行】pip install echarts-china-counties-pypkg</p>
<p>【Anaconda Prompt命令行】pip install echarts-china-misc-pypkg</p>
<p>【Anaconda Prompt命令行】pip install echarts-united-kingdom-pypkg</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> pyecharts.charts <span class="keyword">import</span> Pie</span><br><span class="line"><span class="keyword">from</span> pyecharts <span class="keyword">import</span> options <span class="keyword">as</span> opts</span><br><span class="line">%config InlineBackend.figure_format = <span class="string">'svg'</span></span><br><span class="line">%matplotlib inline</span><br><span class="line"></span><br><span class="line">labels = [<span class="number">200</span>, <span class="number">500</span>, <span class="number">501</span>, <span class="number">302</span>, <span class="number">403</span>, <span class="number">304</span>, <span class="number">404</span>]</span><br><span class="line">fracs = [<span class="number">0.891149538976086</span>, <span class="number">1.911904537331848e-06</span>, <span class="number">1.720714083598663e-05</span>, <span class="number">0.01685025198901802</span>, </span><br><span class="line">         <span class="number">0.00010897855862791534</span>, <span class="number">0.08548635027620648</span>, <span class="number">0.006385761154688373</span>]</span><br><span class="line"></span><br><span class="line">pie = (</span><br><span class="line">        Pie()</span><br><span class="line">        .add(<span class="string">""</span>, [list(z) <span class="keyword">for</span> z <span class="keyword">in</span> zip(labels, [<span class="string">'%.4f'</span> % (x*<span class="number">100</span>) <span class="keyword">for</span> x <span class="keyword">in</span> fracs])])</span><br><span class="line">        .set_global_opts(title_opts=opts.TitleOpts(title=<span class="string">"返回状态码分析"</span>))</span><br><span class="line">        .set_series_opts(label_opts=opts.LabelOpts(formatter=<span class="string">"【&#123;b&#125;】: &#123;c&#125;%"</span>))</span><br><span class="line">    )</span><br><span class="line"></span><br><span class="line">pie.render_notebook()</span><br></pre></td></tr></table></figure>

      
    </div>
    <footer class="article-footer">
      
      
      
        
	<div id="comment">
		<!-- 来必力City版安装代码 -->
		<div id="lv-container" data-id="city" data-uid="MTAyMC80NTk2OS8yMjQ4MA==">
		<script type="text/javascript">
		   (function(d, s) {
		       var j, e = d.getElementsByTagName(s)[0];

		       if (typeof LivereTower === 'function') { return; }

		       j = d.createElement(s);
		       j.src = 'https://cdn-city.livere.com/js/embed.dist.js';
		       j.async = true;

		       e.parentNode.insertBefore(j, e);
		   })(document, 'script');
		</script>
		<noscript>为正常使用来必力评论功能请激活JavaScript</noscript>
		</div>
		<!-- City版安装代码已完成 -->
	</div>



      
      
        
  <ul class="article-tag-list"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/RDD/">RDD</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/pyecharts/">pyecharts</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/spark/">spark</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/正则表达式/">正则表达式</a></li></ul>

      

    </footer>
  </div>
  
    
<nav id="article-nav">
  
    <a href="/2019/08/11/Spark_DataFrame/" id="article-nav-newer" class="article-nav-link-wrap">
      <strong class="article-nav-caption">上一篇</strong>
      <div class="article-nav-title">
        
          【Spark】PySpark之DataFrame
        
      </div>
    </a>
  
  
</nav>

  
</article>

<!-- Table of Contents -->

  <aside id="toc-sidebar">
    <div id="toc" class="toc-article">
    <strong class="toc-title">文章目录</strong>
    
        <ol class="nav"><li class="nav-item nav-level-2"><a class="nav-link" href="#1-思维导图"><span class="nav-number">1.</span> <span class="nav-text">1 思维导图</span></a></li><li class="nav-item nav-level-2"><a class="nav-link" href="#2-正则表达式"><span class="nav-number">2.</span> <span class="nav-text">2 正则表达式</span></a></li><li class="nav-item nav-level-2"><a class="nav-link" href="#3-读入数据"><span class="nav-number">3.</span> <span class="nav-text">3 读入数据</span></a></li><li class="nav-item nav-level-2"><a class="nav-link" href="#4-RDD操作"><span class="nav-number">4.</span> <span class="nav-text">4 RDD操作</span></a></li><li class="nav-item nav-level-2"><a class="nav-link" href="#5-数据统计可视化"><span class="nav-number">5.</span> <span class="nav-text">5 数据统计可视化</span></a></li></ol>
    
    </div>
  </aside>

</section>
        
      </div>
      
      <footer id="footer">
  

  <div class="container">
      	<div class="row">
	      <p> Powered by <a href="http://hexo.io/" target="_blank">Hexo</a> and <a href="https://github.com/iTimeTraveler/hexo-theme-hiker" target="_blank">Hexo-theme-hiker</a> </p>
	      <p id="copyRightEn">Copyright &copy; 2013 - 2020 MaxMa All Rights Reserved.</p>
	      
	      
    		<p class="busuanzi_uv">
				访客数 : <span id="busuanzi_value_site_uv"></span> |  
				访问量 : <span id="busuanzi_value_site_pv"></span>
		    </p>
  		   
		</div>

		
  </div>
</footer>


<!-- min height -->

<script>
    var wrapdiv = document.getElementById("wrap");
    var contentdiv = document.getElementById("content");
    var allheader = document.getElementById("allheader");

    wrapdiv.style.minHeight = document.body.offsetHeight + "px";
    if (allheader != null) {
      contentdiv.style.minHeight = document.body.offsetHeight - allheader.offsetHeight - document.getElementById("footer").offsetHeight + "px";
    } else {
      contentdiv.style.minHeight = document.body.offsetHeight - document.getElementById("footer").offsetHeight + "px";
    }
</script>
    </div>
    <!-- <nav id="mobile-nav">
  
    <a href="/" class="mobile-nav-link">Home</a>
  
    <a href="/archives" class="mobile-nav-link">Archives</a>
  
    <a href="/categories" class="mobile-nav-link">Categories</a>
  
    <a href="/tags" class="mobile-nav-link">Tags</a>
  
    <a href="/about" class="mobile-nav-link">About</a>
  
</nav> -->
    

<!-- mathjax config similar to math.stackexchange -->

<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    tex2jax: {
      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
      processEscapes: true
    }
  });
</script>

<script type="text/x-mathjax-config">
    MathJax.Hub.Config({
      tex2jax: {
        skipTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code']
      }
    });
</script>

<script type="text/x-mathjax-config">
    MathJax.Hub.Queue(function() {
        var all = MathJax.Hub.getAllJax(), i;
        for(i=0; i < all.length; i += 1) {
            all[i].SourceElement().parentNode.className += ' has-jax';
        }
    });
</script>

<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>


  <link rel="stylesheet" href="/fancybox/jquery.fancybox.css">
  <script src="/fancybox/jquery.fancybox.pack.js"></script>


<script src="/js/scripts.js"></script>




  <script src="/js/dialog.js"></script>








	<div style="display: none;">
    <script src="https://s95.cnzz.com/z_stat.php?id=1260716016&web_id=1260716016" language="JavaScript"></script>
  </div>



	<script async src="//busuanzi.ibruce.info/busuanzi/2.3/busuanzi.pure.mini.js">
	</script>






  </div>

  <div class="modal fade" id="myModal" tabindex="-1" role="dialog" aria-labelledby="myModalLabel" aria-hidden="true" style="display: none;">
  <div class="modal-dialog">
    <div class="modal-content">
      <div class="modal-header">
        <h2 class="modal-title" id="myModalLabel">设置</h2>
      </div>
      <hr style="margin-top:0px; margin-bottom:0px; width:80%; border-top: 3px solid #000;">
      <hr style="margin-top:2px; margin-bottom:0px; width:80%; border-top: 1px solid #000;">


      <div class="modal-body">
          <div style="margin:6px;">
            <a data-toggle="collapse" data-parent="#accordion" href="#collapseOne" onclick="javascript:setFontSize();" aria-expanded="true" aria-controls="collapseOne">
              正文字号大小
            </a>
          </div>
          <div id="collapseOne" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne">
          <div class="panel-body">
            您已调整页面字体大小
          </div>
        </div>
      


          <div style="margin:6px;">
            <a data-toggle="collapse" data-parent="#accordion" href="#collapseTwo" onclick="javascript:setBackground();" aria-expanded="true" aria-controls="collapseTwo">
              夜间护眼模式
            </a>
        </div>
          <div id="collapseTwo" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingTwo">
          <div class="panel-body">
            夜间模式已经开启，再次单击按钮即可关闭 
          </div>
        </div>

        <div>
            <a data-toggle="collapse" data-parent="#accordion" href="#collapseThree" aria-expanded="true" aria-controls="collapseThree">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;关 于&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</a>
        </div>
         <div id="collapseThree" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingThree">
          <div class="panel-body">
            MaxMa
          </div>
          <div class="panel-body">
            Copyright © 2020 MaxMa All Rights Reserved.
          </div>
        </div>
      </div>


      <hr style="margin-top:0px; margin-bottom:0px; width:80%; border-top: 1px solid #000;">
      <hr style="margin-top:2px; margin-bottom:0px; width:80%; border-top: 3px solid #000;">
      <div class="modal-footer">
        <button type="button" class="close" data-dismiss="modal" aria-label="Close"><span aria-hidden="true">×</span></button>
      </div>
    </div>
  </div>
</div>
  
  <a id="rocket" href="#top" class=""></a>
  <script type="text/javascript" src="/js/totop.js?v=1.0.0" async=""></script>
  
    <a id="menu-switch"><i class="fa fa-bars fa-lg"></i></a>
  
<script type="text/x-mathjax-config">
    MathJax.Hub.Config({
        tex2jax: {
            inlineMath: [ ["$","$"], ["\\(","\\)"] ],
            skipTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code'],
            processEscapes: true
        }
    });
    MathJax.Hub.Queue(function() {
        var all = MathJax.Hub.getAllJax();
        for (var i = 0; i < all.length; ++i)
            all[i].SourceElement().parentNode.className += ' has-jax';
    });
</script>
<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
</body>
</html>